Dataset statistics
| Number of variables | 4 |
|---|---|
| Number of observations | 36282 |
| Missing cells | 2882 |
| Missing cells (%) | 2.0% |
| Duplicate rows | 5108 |
| Duplicate rows (%) | 14.1% |
| Total size in memory | 2.1 MiB |
| Average record size in memory | 62.1 B |
Variable types
| Numeric | 2 |
|---|---|
| Boolean | 2 |
| Dataset has 5108 (14.1%) duplicate rows | Duplicates |
q is highly overall correlated with w | High correlation |
w is highly overall correlated with q | High correlation |
q_flag is highly overall correlated with w_flag | High correlation |
w_flag is highly overall correlated with q_flag | High correlation |
q has 1468 (4.0%) missing values | Missing |
w has 1353 (3.7%) missing values | Missing |
Reproduction
| Analysis started | 2023-04-17 15:43:45.937748 |
|---|---|
| Analysis finished | 2023-04-17 15:43:52.715937 |
| Duration | 6.78 seconds |
| Software version | pandas-profiling vv3.5.0 |
| Download configuration | config.json |
| Distinct | 1643 |
|---|---|
| Distinct (%) | 4.7% |
| Missing | 1468 |
| Missing (%) | 4.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 12.655559 |
| Minimum | 0.693 |
|---|---|
| Maximum | 281 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 1.6 MiB |
Quantile statistics
| Minimum | 0.693 |
|---|---|
| 5-th percentile | 2.44 |
| Q1 | 4.79 |
| median | 8.145 |
| Q3 | 14.6 |
| 95-th percentile | 38.335 |
| Maximum | 281 |
| Range | 280.307 |
| Interquartile range (IQR) | 9.81 |
Descriptive statistics
| Standard deviation | 14.107686 |
|---|---|
| Coefficient of variation (CV) | 1.1147423 |
| Kurtosis | 28.360781 |
| Mean | 12.655559 |
| Median Absolute Deviation (MAD) | 4.055 |
| Skewness | 3.9881842 |
| Sum | 440590.63 |
| Variance | 199.02682 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 10.8 | 239 | 0.7% |
| 11.1 | 208 | 0.6% |
| 10.2 | 189 | 0.5% |
| 11.5 | 186 | 0.5% |
| 10.4 | 185 | 0.5% |
| 10.1 | 184 | 0.5% |
| 10.6 | 176 | 0.5% |
| 12.2 | 172 | 0.5% |
| 11.8 | 172 | 0.5% |
| 9.45 | 167 | 0.5% |
| Other values (1633) | 32936 | |
| (Missing) | 1468 | 4.0% |
| Value | Count | Frequency (%) |
| 0.693 | 1 | < 0.1% |
| 0.767 | 1 | < 0.1% |
| 0.946 | 2 | < 0.1% |
| 1.03 | 1 | < 0.1% |
| 1.04 | 3 | < 0.1% |
| 1.13 | 8 | |
| 1.14 | 1 | < 0.1% |
| 1.19 | 1 | < 0.1% |
| 1.23 | 13 | |
| 1.24 | 1 | < 0.1% |
| Value | Count | Frequency (%) |
| 281 | 1 | |
| 252 | 1 | |
| 208 | 1 | |
| 203 | 1 | |
| 202 | 1 | |
| 201 | 1 | |
| 195 | 1 | |
| 185 | 1 | |
| 182 | 1 | |
| 181 | 1 |
q_flag
Boolean
| Distinct | 2 |
|---|---|
| Distinct (%) | < 0.1% |
| Missing | 61 |
| Missing (%) | 0.2% |
| Memory size | 1.6 MiB |
| False | |
|---|---|
| True | |
| (Missing) | 61 |
| Value | Count | Frequency (%) |
| False | 31473 | |
| True | 4748 | 13.1% |
| (Missing) | 61 | 0.2% |
| Distinct | 266 |
|---|---|
| Distinct (%) | 0.8% |
| Missing | 1353 |
| Missing (%) | 3.7% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 104.52504 |
| Minimum | 34 |
|---|---|
| Maximum | 385 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 1.6 MiB |
Quantile statistics
| Minimum | 34 |
|---|---|
| 5-th percentile | 49 |
| Q1 | 76 |
| median | 101 |
| Q3 | 127 |
| 95-th percentile | 175 |
| Maximum | 385 |
| Range | 351 |
| Interquartile range (IQR) | 51 |
Descriptive statistics
| Standard deviation | 39.1233 |
|---|---|
| Coefficient of variation (CV) | 0.37429597 |
| Kurtosis | 1.3424243 |
| Mean | 104.52504 |
| Median Absolute Deviation (MAD) | 25 |
| Skewness | 0.85669362 |
| Sum | 3650955 |
| Variance | 1530.6326 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 106 | 453 | 1.2% |
| 98 | 448 | 1.2% |
| 104 | 441 | 1.2% |
| 99 | 434 | 1.2% |
| 96 | 425 | 1.2% |
| 105 | 424 | 1.2% |
| 95 | 424 | 1.2% |
| 92 | 420 | 1.2% |
| 102 | 412 | 1.1% |
| 100 | 409 | 1.1% |
| Other values (256) | 30639 | |
| (Missing) | 1353 | 3.7% |
| Value | Count | Frequency (%) |
| 34 | 1 | < 0.1% |
| 35 | 3 | < 0.1% |
| 36 | 4 | < 0.1% |
| 37 | 30 | 0.1% |
| 38 | 52 | 0.1% |
| 39 | 65 | |
| 40 | 83 | |
| 41 | 95 | |
| 42 | 134 | |
| 43 | 139 |
| Value | Count | Frequency (%) |
| 385 | 1 | |
| 368 | 1 | |
| 342 | 2 | |
| 338 | 1 | |
| 337 | 1 | |
| 335 | 1 | |
| 325 | 2 | |
| 319 | 1 | |
| 315 | 1 | |
| 312 | 2 |
w_flag
Boolean
| Distinct | 2 |
|---|---|
| Distinct (%) | < 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 1.3 MiB |
| False | |
|---|---|
| True |
| Value | Count | Frequency (%) |
| False | 30438 | |
| True | 5844 | 16.1% |
Auto
The auto setting is an interpretable pairwise column metric of the following mapping:- Variable_type-Variable_type : Method, Range
- Categorical-Categorical : Cramer's V, [0,1]
- Numerical-Categorical : Cramer's V, [0,1] (using a discretized numerical column)
- Numerical-Numerical : Spearman's ρ, [-1,1]
This configuration uses the recommended metric for each pair of columns.
Spearman's ρ
The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.
Pearson's r
The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.
Kendall's τ
Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.
Cramér's V (φc)
Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.Phik (φk)
Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here. A simple visualization of nullity by column.
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.
The correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another.
| q | q_flag | w | w_flag | |
|---|---|---|---|---|
| date | ||||
| 1922-09-01 | NaN | NaN | 109.0 | False |
| 1922-09-02 | NaN | NaN | 107.0 | False |
| 1922-09-03 | NaN | NaN | 104.0 | False |
| 1922-09-04 | NaN | NaN | 102.0 | False |
| 1922-09-05 | NaN | NaN | 100.0 | False |
| 1922-09-06 | NaN | NaN | 98.0 | False |
| 1922-09-07 | NaN | NaN | 97.0 | False |
| 1922-09-08 | NaN | NaN | 96.0 | False |
| 1922-09-09 | NaN | NaN | 95.0 | False |
| 1922-09-10 | NaN | NaN | 95.0 | False |
| q | q_flag | w | w_flag | |
|---|---|---|---|---|
| date | ||||
| 2021-12-22 | 6.27 | True | 68.0 | True |
| 2021-12-23 | 5.89 | True | 66.0 | True |
| 2021-12-24 | 6.31 | True | 69.0 | True |
| 2021-12-25 | 12.90 | True | 102.0 | True |
| 2021-12-26 | 14.70 | True | 110.0 | True |
| 2021-12-27 | 17.00 | True | 119.0 | True |
| 2021-12-28 | 23.20 | True | 138.0 | True |
| 2021-12-29 | 36.60 | True | 171.0 | True |
| 2021-12-30 | 39.40 | True | 176.0 | True |
| 2021-12-31 | 34.70 | True | 168.0 | True |
Most frequently occurring
| q | q_flag | w | w_flag | # duplicates | |
|---|---|---|---|---|---|
| 5095 | NaN | False | NaN | False | 1353 |
| 2503 | 7.78 | False | 105.0 | False | 103 |
| 2178 | 6.86 | False | 102.0 | False | 98 |
| 2890 | 9.09 | False | 109.0 | False | 98 |
| 2393 | 7.45 | False | 104.0 | False | 97 |
| 2603 | 8.09 | False | 106.0 | False | 96 |
| 2286 | 7.16 | False | 103.0 | False | 95 |
| 1830 | 5.92 | False | 99.0 | False | 93 |
| 2798 | 8.77 | False | 108.0 | False | 92 |
| 1504 | 5.10 | False | 92.0 | False | 90 |